1 Introduction

Generalized linear models (GLM) as the name implies are a generalization of the linear modeling framework to allow for the modeling of response variables (e.g. soil attributes) with non-normal distributions and heterogeneous variances. Whereas linear models are designed for predicting continuous soil properties such as clay content or soil temperature, GLM can be used to predict the presence/absence of argillic horizons (i.e. logistic regression) or counts of a plant species along a transect (i.e. Poisson regression). These generalizations greatly expand the applicability of the linear modeling framework, while still allowing for a similar fitting procedure and interpretation of the resulting models.

In the past in order to handle non-linearity and heterogeneous variances, transformations have been made to the response variable, such as the log(x). However, such transformations complicate the models interpretation because the results refer to the transformed scale (e.g. log(x)). These response transformations are not guaranteed to achieve both normality and constant variance simultaneously. GLM approaches transform the response, but also preserve the scale of the response, and provide separate functions to transform the mean response and variance, known as the link and variance functions respectively. So instead of looking like this:

\(f(y) = \beta_{0} + \beta_{1}x + \varepsilon\)

you get this:

\(g(\mu)\) or \(\eta = \beta_{0} + \beta_{1}x + \varepsilon\)

with \(g(\mu)\) or \(\eta\) symbolizing the link function.

Another alteration of the classical linear model is that with GLM the coefficients are estimated iteratively by maximum likelihood estimation instead of ordinary least squares. This results in the GLM minimizing the deviance, instead of the sum of squares. However, for the Gaussian (i.e. normal) distributions the deviance and sum of squares are equivalent.

2 Logistic Regression

Logistic regression is a specific type of GLM designed to model data that has a binomial distribution (i.e. presence/absence, yes/no, or proportional data), which in statistical learning parlance is considered a classification problem. For binomial data the logit link transform is generally used. The effect of the logit transform can be seen in the following figure. It creates a sigmoidal curve, which enhances the separation between the two groups. It also has the effect of ensuring that the values range between 0 and 1.

When comparing a simple linear model vs a simple logistic model we can see the effect of the logit transform on the relationship between the response and predictor variable. As before it follows a sigmoidal curve and prevents predictions from exceeding 0 and 1.

3 Examples

Example 1: Probability of Mollisols (Beaudette & O’Geen, 2009)

Example 1: Probability of Mollisols (Beaudette & O’Geen, 2009)

Example 2: Probability of Red Clay (Evans & Hartemink, 2014)

Example 2: Probability of Red Clay (Evans & Hartemink, 2014)

Example 3: Probability of Ponding (NRCS Unpublished)

Example 3: Probability of Ponding (NRCS Unpublished)

4 Exercise

Now that we’ve discussed some of the basic background GLM theory we’ll move on to a real exercise, and address any additional theory where it relates to specific steps in the modeling process. The examples selected for this chapter come from Joshua Tree National Park (JTNP)(i.e. CA794) in the Mojave desert. The problem tackled here is a familiar one: Where can I expect to find argillic horizons on fan piedmonts? Argillic horizons within the Mojave are typically found on fan remnants, which are a stable landform that is a remnant of the Pleistocene (Peterson, 1981). Despite the low relief of most fans, fan remnants are uplands in the sense that they generally don’t receive run-on or active deposition.

With this dataset we’ll encounter some challenges. To start with, fan piedmont landscapes typically have relatively little relief. Since most of our predictors will be derivatives of elevation, that won’t leave us with much to work with. Also, our elevation data comes from the USGS National Elevation dataset (NED), which provides considerably less detail than say LiDAR or IFSAR data (Shi et al., 2012). Lastly our pedon dataset like most in NASIS, hasn’t received near as much quality control as have the components. So we’ll need to wrangle some of the pedon data before we can analyze it. These are all typical problems encountered in any data analysis and should be good practice. Ideally, it would be more interesting to try and model individual soil series with argillic horizons, but due to some of the challenges previously mentioned it would be difficult with this dataset. However, at the end we’ll look at one simple approach to try and separate individual soil series with argillic horizons.

4.1 Load packages

To start, as always we need to load some extra packages. This will become a familiar routine every time you start R. Most of the basic functions we need to develop a logistic regression model are contained in base R, but the following contain some useful spatial and data manipulation functions. Believe it or not we will use all of them and more.

library(aqp) # specialized soil classes and functions
library(soilDB) # NASIS and SDA import functions
library(raster) # guess
library(rgdal) # spatial import
library(ggplot2) # graphing
library(tidyr) # data manipulation
library(caret) # classification and regression training
library(car) # additional regression tools

4.2 Read in data

Hopefully like all good soil scientists and ecological site specialists you enter your field data into NASIS. Better yet hopefully someone else did it for you! Once data are captured in NASIS it is much easier to import the data into R, extract the pieces you need, manipulate it, model it, etc. If it’s not entered into NASIS, it may as well not exist.

# pedons <- fetchNASIS()
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ch7_data.Rdata"
load(url(githubURL))

# Examine the makeup of the data we imported from NASIS
str(pedons, max.level = 2)
## Formal class 'SoilProfileCollection' [package "aqp"] with 7 slots
##   ..@ idcol     : chr "peiid"
##   ..@ depthcols : chr [1:2] "hzdept" "hzdepb"
##   ..@ metadata  :'data.frame':   1 obs. of  1 variable:
##   ..@ horizons  :'data.frame':   4990 obs. of  43 variables:
##   ..@ site      :'data.frame':   1168 obs. of  79 variables:
##   ..@ sp        :Formal class 'SpatialPoints' [package "sp"] with 3 slots
##   ..@ diagnostic:'data.frame':   2133 obs. of  4 variables:

5 Exploratory analysis

5.1 Data wrangling

Generally before we begin modeling you should spend some time exploring the data. By examining a simple summary we can quickly see the breakdown of how many argillic horizons we have. Unfortunately, odds are good that all the argillic horizons haven’t been consistently populated in the diagnostic horizon table like they should be. Luckily for us, the desert argillic horizons always pop up in the taxonomic name, so we can use pattern matching to extract it. By doing this we gain an additional 11 pedons with argillic horizons and are able to label the missing values (i.e. NA). At a minimum for modeling purposes we probably need 10 pedons of the target we’re interested in and a total of 100 observations overall.

# Check consistency of argillic horizon population

# get the site table
s <- site(pedons) 

# tabulate the number of argillic horizons observed
table(s$argillic.horizon, useNA = "ifany") 
## 
## FALSE  TRUE  <NA> 
##   750   272   146
# or

# summary(s$argillic.horizon) 

# Extract argillic presence from the taxonomic subgroup
s$argillic <- grepl("arg", s$tax_subgroup)

table(s$argillic, useNA = "ifany")
## 
## FALSE  TRUE 
##   886   282

Ideally, if the diagnostic horizon table had been populated consistently we could have used the upper depth to diagnostic feature to filter out argillic horizons that start below 50cm, which may not be representative of “good” argillic horizons and may therefore have gotten correlated to a Torripsamments anyway. Not only are unrepresentative sites confusing for scientists, they’re equally confusing for models. However, as we saw earlier, some pedons don’t appear to be fully populated, so we’ll stick with those pedons that have the argillic specified in their taxonomic subgroup name, since it gives us the biggest sample.

d <- diagnostic_hz(pedons)
peiid <- subset(d, diag_kind == "argillic horizon" & featdept < 50, select = peiid)
test <- s$peiid %in% unique(peiid)
summary(test)
##    Mode   FALSE 
## logical    1168

5.2 Geomorphic data

Another obvious place to look is at the geomorphic data in the site table. This information is intended to help differentiate where our soil observations exist on the landscape. If populated consistently it could potentially be used in future disaggregation efforts, as demonstrated by Nauman and Thompson (2014).

# Landform vs argillic presence

# Subset
s_sub <- subset(s, argillic == TRUE)

# Cross tabulate landform vs argillic horizon presence
test <- with(s_sub, 
             table(landform.string, argillic, useNA = "ifany")
             )
# Subset and print landform.string with > 3 observations
test[test > 3,]
##            alluvial fan               fan apron fan apron & fan remnant 
##                       9                      27                      25 
##             fan remnant                    hill               hillslope 
##                     109                      15                      32 
##                low hill                mountain          mountain slope 
##                       5                       6                       8 
##                pediment                    <NA> 
##                      11                       6
# generalize the landform.string
s$landform <- ifelse(grepl("fan|terrace|sheet|drainageway|wash", s$landform.string), "fan", "hill") 

Examining the above frequency table we can see that argillic horizons occur predominantly on fan remnants as was alluded too earlier. However, they also seem to occur frequently on other landforms - some of which are curious combinations of landforms or redundant terms.

# Hillslope position

# Subset fan landforms
s_sub <- subset(s, landform == "fan") 

# Cross tabulate and calculate proportions, the "2" calculates the proportions relative to the column totals
with(s_sub, round(
  prop.table(table(hillslope_pos, argillic, useNA = "ifany"), 2)
  * 100)
  ) 
##              argillic
## hillslope_pos FALSE TRUE
##     Toeslope     18    5
##     Footslope     2    2
##     Backslope    15   16
##     Shoulder      3    4
##     Summit       15   32
##     <NA>         47   40
# Slope shape
with(s_sub, round(
  prop.table(table(paste(shapedown, shapeacross), argillic, useNA = "ifany"), 2)
  * 100)
  )
##                  argillic
##                   FALSE TRUE
##   Concave Concave     1    1
##   Concave Convex      0    0
##   Concave Linear      4    1
##   Convex Concave      0    0
##   Convex Convex       8    8
##   Convex Linear       7    7
##   Linear Concave      7    3
##   Linear Convex      20   29
##   Linear Linear      41   45
##   Linear NA           0    0
##   NA NA              12    8

Looking at the hillslope position of fan landforms we can see a slightly higher proportion of argillic horizons are found on summits, while less are found on toeslopes. Slope shape doesn’t seem to provide any useful information for distinguishing argillic horizons.

# Surface morphometry, depth and surface rock fragments

# Recalculate gravel
s$surface_gravel <- with(s, 
                         surface_gravel - surface_fgravel
                         )
# Calculate the total surface rock fragments
s$frags <- apply(s[grepl("surface", names(s))], 1, sum) 

# Subset to just look and fans, and select numeric columns
s_sub <- subset(s, landform == "fan", select = c(argillic, bedrckdepth, slope_field, elev_field, frags)) 

# convert s_sub to wide data format
s_w <- gather(s_sub, key = key, value = value, - argillic) 
head(s_w, 2)
##   argillic         key value
## 1    FALSE bedrckdepth    NA
## 2    FALSE bedrckdepth    11
ggplot(s_w, aes(x = argillic, y = value)) +
  geom_boxplot() +
  facet_wrap(~ key, scale = "free")

Looking at our numeric variables only depth to bedrock seems to show much separation between the presence/absence of argillic horizons.

5.3 Soil Scientist Bias

Next we’ll look at soil scientist bias. The question being: Are some soil scientists more likely to describe argillic horizons than others? Due to the excess number of soil scientist that have worked on CA794, including detailees, we’ve filtered the names of soil scientist to include just the top 3 mappers and given priority to the most senior soil scientists when they occur together.

# Custom function to filter out the top 3 soil scientists
s <- within(s, {
  old = describer
  describer2 = NA
  describer2[grepl("Stephen", old)] = "Stephen" # least senior
  describer2[grepl("Paul",    old)] = "Paul"
  describer2[grepl("Peter",   old)] = "Peter"   # most senior
  })
s_sub <- subset(s, landform == "fan")

# By frequency
with(s_sub, table(describer2, argillic, useNA = "ifany"))
##           argillic
## describer2 FALSE TRUE
##    Paul       64   28
##    Peter     208   86
##    Stephen    58   19
##    <NA>      134   58
# By proportion
with(s_sub, round(
  prop.table(table(describer2, argillic), margin = 1)
  * 100)
  )
##           argillic
## describer2 FALSE TRUE
##    Paul       70   30
##    Peter      71   29
##    Stephen    75   25

For fan landforms, none of the soil scientists seem more likely than the others to describe argillic horizons. However while this information is suggestive, it is far from definitive in showing a potential bias because it doesn’t take into account other factors. We’ll examine this more closely later.

5.4 Plot coordinates

Where do our points plot? We can plot the general location in R, but for this task we will export them to a Shapefile, so we can view them in a proper GIS, and really inspect them. Notice in the figure below the number of points that fall outside the survey boundary. What it doesn’t show is the points that may plot in the Ocean or Mexico!

# Convert soil profile collection to a spatial object
pedons2 <- pedons
slot(pedons2, "site") <- s # this is dangerous, but something needs to be fixed in the site() setter function
idx <- complete.cases(site(pedons2)[c("x", "y")]) # create an index to filter out pedons with missing coordinates
pedons2 <- pedons2[idx]
coordinates(pedons2) <- ~ x + y # set the coordinates
proj4string(pedons2) <- CRS("+init=epsg:4326") # set the projection
pedons_sp <- as(pedons2, "SpatialPointsDataFrame") # coerce to spatial object
pedons_sp <- spTransform(pedons_sp, CRS("+init=epsg:5070")) # reproject

# Read in soil survey area boundaries
# ssa <- readOGR(dsn = "F:/geodata/soils/soilsa_a_nrcs.shp", layer = "soilsa_a_nrcs")
# ca794 <- subset(ssa, areasymbol == "CA794") # subset out Joshua Tree National Park
# ca794 <- spTransform(ca794, CRS("+init=epsg:5070"))

# Plot
plot(ca794, axes = TRUE)
plot(pedons_sp, add = TRUE) # notice the points outside the boundary

# Write shapefile of pedons
writeOGR(pedons_sp, dsn = "C:/workspace2", "pedons_sp", driver = "ESRI Shapefile", overwrite_layer = TRUE)

5.4.1 Exercise 1: View the data in ArcGIS

  • Examine the shapefile in ArcGIS along with our potential predictive variables (hint classify the Shapefile symbology using the argillic horizon column)
  • Discuss with your group, and report your observations or hypotheses

5.5 Extracting spatial data

Prior to any spatial analysis or modeling, you will need to develop a suite of geodata files that can be intersected with your field data locations. This is, in and of itself a difficult task, and should be facilitated by your Regional GIS Specialist. Typically, these geodata files would primarily consist of derivatives from a DEM or satellite imagery. Prior to any prediction it is also necessary to ensure the geodata files have the same projection, extent, and cell size. Once we have the necessary files we can construct a list in R of the file names and paths, read the geodata into R, and then extract the geodata values where they intersect with field data.

# set file path
folder <- "D:/geodata/project_data/R8-VIC/ca794/"

# list of file names
files <- c(
  elev   = "ned30m_8VIC.tif", # elevation
  slope  = "ned30m_8VIC_slope5.tif", # slope gradient
  aspect = "ned30m_8VIC_aspect5.tif", # slope aspect
  twi    = "ned30m_8VIC_wetness.tif", # topographic wetness index
  twi_sc = "ned30m_8VIC_wetness_sc.tif", # transformed twi
  ch     = "ned30m_8VIC_cheight.tif", # catchment height
  z2str  = "ned30m_8VIC_z2stream.tif", # height above streams
  mrrtf  = "ned30m_8VIC_mrrtf.tif", # multiresolution ridgetop flatness index
  mrvbf  = "ned30m_8VIC_mrvbf.tif", # multiresolution valley bottom flatness index
  solar  = "ned30m_8VIC_solar.tif", # solar radiation
  precip = "prism30m_8VIC_ppt_1981_2010_annual_mm.tif", # annual precipitation
  precipsum = "prism30m_8VIC_ppt_1981_2010_summer_mm.tif", # summer precipitation
  temp   = "prism30m_8VIC_tmean_1981_2010_annual_C.tif", # annual temperature
  ls     = "landsat30m_8VIC_b123457.tif", # landsat bands
  pc     = "landsat30m_8VIC_pc123456.tif", # principal components of landsat
  tc     = "landsat30m_8VIC_tc123.tif", # tasseled cap components of landsat
  k      = "gamma30m_8VIC_namrad_k.tif", # gamma radiometrics signatures
  th     = "gamma30m_8VIC_namrad_th.tif",
  u      = "gamma30m_8VIC_namrad_u.tif",
  cluster = "cluster152.tif" # unsupervised classification
  )

# combine the folder directory and file names
geodata_f <- paste0(folder, files) 
names(geodata_f) <- names(files)

# Create a raster stack
geodata_r <- stack(geodata_f)

# Extract the geodata and add to a data frame
data <- raster::extract(geodata_r, pedons_sp, sp = TRUE)@data

# Modify some of the geodata variables
idx <- aggregate(mast ~ cluster, data = data, mean, na.rm = TRUE)
names(idx)[2] <- "cluster_mast"
data <- merge(data, idx, by = "cluster", all.x =  TRUE)

data <- within(data, {
  mast = temp - 4
  cluster  = factor(cluster, levels = 1:15)
  cluster2 = reorder(cluster, cluster_mast)
  gsi      = (ls_3 - ls_1) / (ls_3 + ls_2 + ls_1)
  ndvi     = (ls_4 - ls_3) / (ls_4 + ls_3)
  sw       = cos(aspect - 255)
  twi_sc   = abs(twi - 13.8) # 13.8 = twi median
  })

# save(data, ca794, pedons, file = "C:/workspace2/ch7_data.Rdata")

# Strip out location and personal information before uploading to the internet
# s[c("describer", "describer2", "x", "y", "x_std", "y_std", "utmnorthing", "utmeasting", "classifier")] <- NA
# slot(pedons, "site") <- s
# data[c("describer2", "x_std", "y_std")] <- NA
# save(data, ca794, pedons, file = "C:/workspace2/stats_for_soil_survey/trunk/data/ch7_data.Rdata")

5.6 Examine spatial data

With our spatial data in hand, we can now see whether any of the variables will help us separate the presence/absence of argillic horizons. Because we’re dealing with a classification problem, we’ll compare the numeric variables using boxplots. What we’re looking for are variables with the least amount of overlap in their distribution (i.e. the greatest separation in their median values).

# Load data
load(file = "C:/workspace2/github/stats_for_soil_survey/trunk/data/ch7_data.Rdata")
train <- data

# Select argillic horizons with "arg" in the subgroup name and on fans
# Argillic horizons that occur on hills and mountains more than likely form by different process, and therefore would require a different model.train$argillic 
train$argillic <- ifelse(grepl("arg", train$tax_subgroup) & 
                           train$mrvbf > 0.15,
                         TRUE, FALSE
                         )
train <- subset(train, !is.na(argillic), select = - c(pedon_id, taxonname, x_std, y_std, landform.string, cluster, cluster_mast, argillic.horizon, tax_subgroup, frags)) 

train2 <- subset(train, select = - c(describer2, landform, cluster2))
data_m <- gather(train2, key = key, value = value, - argillic)

ggplot(data_m, aes(x = argillic, y = value)) +
  geom_boxplot() +
  facet_wrap(~ key, scales = "free")

6 Modeling

Modeling is an iterative process that cycles between fitting and evaluating alternative models. Compared to tree and forest models, linear and generalized models require more input from the user. Automated model selection procedures are available, but are discouraged because they generally result in complex and unstable models. This is in part due to correlation amongst the predictive variables that can confuse the model. In addition, the order is which the variables are included or excluded from the model effects the significance of the others, and thus several weak predictors might mask the effect of one strong predictor. Therefore, it is best to begin with a selection of predictors that are known to be useful, and grow the model incrementally.

6.1 Variable Selection

The example below is known as a forward selection procedure, where a full model is fit and compared against a null model, to assess the effect of the different predictors. For testing alternative models the Akaike’s Information Criterion (AIC) is used. When using the AIC to assess predictor significance, a smaller number is better.

full <- glm(argillic ~ ., data = train, family = binomial) # "~ ." includes all columns in the data set
null <- glm(argillic ~ 1, data = train, family = binomial) # "~ 1" just includes an intercept

add1(null, full, test = "Chisq") # using the AIC test the effect of adding additional predictors, generally select the predictor with the smallest AIC unless it goes against your intuition
## Single term additions
## 
## Model:
## argillic ~ 1
##            Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>          872.18 891.63                      
## describer2  3   853.25 878.70  18.932 0.0002824 ***
## landform    1   768.81 790.26 103.377 < 2.2e-16 ***
## elev        1   869.79 891.24   2.398 0.1215117    
## slope       1   704.62 726.07 167.568 < 2.2e-16 ***
## aspect      1   869.85 891.30   2.328 0.1270350    
## twi         1   837.20 858.65  34.983 3.327e-09 ***
## twi_sc      1   690.72 712.17 181.463 < 2.2e-16 ***
## ch          1   870.81 892.26   1.370 0.2418957    
## z2str       1   805.75 827.20  66.429 3.628e-16 ***
## mrrtf       1   823.82 845.27  48.365 3.538e-12 ***
## mrvbf       1   819.41 840.86  52.775 3.740e-13 ***
## solar       1   861.83 883.28  10.353 0.0012926 ** 
## precip      1   864.20 885.65   7.986 0.0047148 ** 
## precipsum   1   854.28 875.73  17.901 2.327e-05 ***
## temp        1   870.08 891.53   2.102 0.1470626    
## ls_1        1   871.93 893.38   0.251 0.6163244    
## ls_2        1   869.79 891.24   2.390 0.1220857    
## ls_3        1   864.88 886.33   7.300 0.0068948 ** 
## ls_4        1   860.60 882.05  11.579 0.0006670 ***
## ls_5        1   846.15 867.60  26.034 3.355e-07 ***
## ls_6        1   847.19 868.64  24.997 5.741e-07 ***
## pc_1        1   855.78 877.23  16.399 5.132e-05 ***
## pc_2        1   836.95 858.40  35.233 2.926e-09 ***
## pc_3        1   858.12 879.57  14.061 0.0001769 ***
## pc_4        1   869.35 890.80   2.828 0.0926066 .  
## pc_5        1   872.17 893.62   0.015 0.9017456    
## pc_6        1   863.85 885.30   8.338 0.0038833 ** 
## tc_1        1   861.58 883.03  10.606 0.0011273 ** 
## tc_2        1   869.32 890.77   2.863 0.0906327 .  
## tc_3        1   836.92 858.37  35.260 2.886e-09 ***
## k           1   870.13 891.58   2.051 0.1521487    
## th          1   871.26 892.71   0.920 0.3374110    
## u           1   872.02 893.47   0.162 0.6877600    
## mast        1   870.08 891.53   2.102 0.1470626    
## cluster2   12   760.01 803.46 112.172 < 2.2e-16 ***
## gsi         1   835.26 856.71  36.921 1.230e-09 ***
## ndvi        1   868.77 890.22   3.409 0.0648437 .  
## sw          1   872.16 893.61   0.023 0.8802474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see as the boxplots showed earlier that twi_sc has the smallest AIC and reduces the deviances the most. So let’s add twi_sc to the null model using the update() function. Then continue using the add1() or drop1() functions, until the model is saturated.

argi.glm <- update(null, . ~ . + twi_sc) # add twi_sc to the model, "-" will subtract predictors

# or refit

# argi.glm <- glm(argillic ~ twi_sc, data = train, family = binomial)

# add1(argi.glm, full, test = "Chisq") # iterate until the model is saturated

# drop1(argi.glm, test = "Chisq") # test effect of dropping a predictor

argi.glm <- glm(argillic ~ twi_sc + slope + ls_1 + ch + z2str + mrvbf, data = train, family = binomial)

# Examine the effect and error for each predictors
summary(argi.glm) 
## 
## Call:
## glm(formula = argillic ~ twi_sc + slope + ls_1 + ch + z2str + 
##     mrvbf, family = binomial, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.52678  -0.56621  -0.11136  -0.00255   2.87734  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.951983   0.810486   4.876 1.08e-06 ***
## twi_sc      -0.630857   0.097191  -6.491 8.53e-11 ***
## slope       -0.248590   0.053932  -4.609 4.04e-06 ***
## ls_1        -0.024334   0.009363  -2.599  0.00935 ** 
## ch          -0.002913   0.001175  -2.479  0.01318 *  
## z2str       -0.020934   0.006033  -3.470  0.00052 ***
## mrvbf       -0.285661   0.133716  -2.136  0.03265 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 875.83  on 989  degrees of freedom
## Residual deviance: 600.54  on 983  degrees of freedom
##   (40 observations deleted due to missingness)
## AIC: 614.54
## 
## Number of Fisher Scoring iterations: 8
# Convert the coefficients to an odds scale, who here gambles?
exp(coef(argi.glm))
## (Intercept)      twi_sc       slope        ls_1          ch       z2str 
##  52.0384651   0.5321354   0.7798994   0.9759592   0.9970909   0.9792839 
##       mrvbf 
##   0.7515171
# Importance of each predictor assessed by the amount of deviance they explain
anova(argi.glm) 
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: argillic
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev
## NULL                     989     875.83
## twi_sc  1  182.883       988     692.95
## slope   1   54.734       987     638.22
## ls_1    1   18.532       986     619.68
## ch      1    3.815       985     615.87
## z2str   1   10.608       984     605.26
## mrvbf   1    4.719       983     600.54

After the model is saturated you should end up with a model similar to the one above.

6.2 Residual Plots

After we’re satisfied no additional variables will improve the fit, we need to evaluate it’s residuals, collinearity, accuracy, and model coefficients.

# Residual Plots for GLM
par(mfrow = c(2, 2))
plot(argi.glm)

residualPlots(argi.glm, fit = FALSE, tests = FALSE)

6.3 Multicolinearity

The variance inflation factor (VIF) is used to assess collinearity amongst the predictors. Its square root indicates the amount of increase in the predictor coefficients standard error. A value greater than 2 indicates a doubling the standard error. Rules of thumb vary, but a square root of vif greater than 2 or 3 indicates an unacceptable value.

# Variance inflation, greater than 5 or 10 is bad
vif(argi.glm)
##   twi_sc    slope     ls_1       ch    z2str    mrvbf 
## 1.100552 1.684762 1.170740 1.163639 1.373601 1.879715

6.4 Accuracy Assement

Because we’re dealing with a classification problem, we have to consider both errors of commission (Type I) and omission (Type II), or their corresponding accuracies of sensitivity (producer’s accuracy) and positive predicted value (user’s accuracy or precision) respectively. Before we can assess the error, however, we need to select a probability threshold.

  • Sensitivity and specificity examine how well the ground truth or reference data compares to the predictions.
  • Positive and negative predicted values (user’s accuracy) examine the inverse concept of how well the predictions match the reference data
# examine possible thresholds
train$predict <- predict(argi.glm, train, type = "response")

ggplot(train, aes(x = predict, fill = argillic)) +
  geom_density(alpha = 0.5) +
  geom_vline(aes(xintercept = 0.5), lty = "dashed") +
  xlab("probability") +
  scale_x_continuous(breaks = seq(0, 1, 0.2))

train$predict <- train$predict > 0.35

# Confusion Matrix
cm <- table(predicted = train$predict, observed = train$argillic)
confusionMatrix(cm, positive = "TRUE")
## Confusion Matrix and Statistics
## 
##          observed
## predicted FALSE TRUE
##     FALSE   733   67
##     TRUE     97   93
##                                          
##                Accuracy : 0.8343         
##                  95% CI : (0.8097, 0.857)
##     No Information Rate : 0.8384         
##     P-Value [Acc > NIR] : 0.65423        
##                                          
##                   Kappa : 0.4317         
##  Mcnemar's Test P-Value : 0.02354        
##                                          
##             Sensitivity : 0.58125        
##             Specificity : 0.88313        
##          Pos Pred Value : 0.48947        
##          Neg Pred Value : 0.91625        
##              Prevalence : 0.16162        
##          Detection Rate : 0.09394        
##    Detection Prevalence : 0.19192        
##       Balanced Accuracy : 0.73219        
##                                          
##        'Positive' Class : TRUE           
## 
# Deviance squared
library(modEvA)
Dsquared(argi.glm)
## [1] 0.314319
# Adjusted deviance squared
Dsquared(argi.glm, adjust = TRUE)
## [1] 0.3101338
  • Discuss the variability of the predictions across the clusters, perhaps different models need to be constructed in each cluster, some clusters appear to be dominated by specific soil series, these data aren’t clean enough (nor are the series concepts usually) to model series separately, however, we could use the clusters as an additional model to attempt to separate the series. Do the hyperthermic clusters perform differently.
library(dplyr)

temp <- subset(train, argillic == TRUE) %>%
  group_by(cluster2) %>%
  summarize(
    sum_arg = sum(argillic, na.rm = TRUE),
    sum_pred = sum(predict, na.rm = TRUE),
    sensitivity = round(sum(predict == argillic) / length(argillic), 2)
    )

ggplot(temp, aes(x = cluster2, y = sensitivity)) +
  geom_point()

# Remove outlier clusters         
train_sub <- subset(train, ! cluster2 %in% c(12, 7))

# full <- glm(argillic ~ ., data = train_sub, family = binomial)
# null <- glm(argillic ~ 1, data = train_sub, family = binomial)
# add1(null, full, train = "Chisq")

sub.glm <- glm(argillic ~ slope + twi_sc + ls_1 + mrvbf + z2str + ch, data = train_sub, family = binomial)

# summary(sub.glm)

train_sub$predict <- predict(sub.glm, train_sub, type = "response") > 0.35
cm <- table(predicted = train_sub$predict, observed = train_sub$argillic)
confusionMatrix(cm, positive = "TRUE")
## Confusion Matrix and Statistics
## 
##          observed
## predicted FALSE TRUE
##     FALSE   622   52
##     TRUE     81   90
##                                           
##                Accuracy : 0.8426          
##                  95% CI : (0.8163, 0.8665)
##     No Information Rate : 0.832           
##     P-Value [Acc > NIR] : 0.21826         
##                                           
##                   Kappa : 0.4795          
##  Mcnemar's Test P-Value : 0.01519         
##                                           
##             Sensitivity : 0.6338          
##             Specificity : 0.8848          
##          Pos Pred Value : 0.5263          
##          Neg Pred Value : 0.9228          
##              Prevalence : 0.1680          
##          Detection Rate : 0.1065          
##    Detection Prevalence : 0.2024          
##       Balanced Accuracy : 0.7593          
##                                           
##        'Positive' Class : TRUE            
## 
temp <- subset(train_sub, argillic == TRUE) %>%
  group_by(cluster2) %>%
  summarize(
    sum_arg  = sum(argillic, na.rm = TRUE),
    sum_pred = sum(predict, na.rm = TRUE),
    sensitivity = round(sum(predict == argillic) / length(argillic), 2)
    )

ggplot(temp, aes(x = cluster2, y = sensitivity)) +
  geom_point()

  • View the results in ArcGIS and examine the accuracy at individual points
  • Discuss the effects of data quality, including both NASIS and GIS
  • Discuss how the modeling process isn’t an end in itself, but serves to uncover trends, possibly generate additional questions and direct future investigations
# Custom function to return the predictions and their standard errors
predfun <- function(model, data) {
  v <- predict(model, data, type = "response", se.fit = TRUE)
  cbind(
    p = as.vector(v$fit),
    se = as.vector(v$se.fit)
    )
  }
  
# Generate spatial predictions
# r <- predict(geodata_r, argi.glm, fun = predfun, index = 1:2, progress = "text")

# Export the results
# writeRaster(r[[1]], "argi.tif", overwrite = T, progress = "text")
# writeRaster(r[[2]], "argi_se.tif", overwrite = T, progress = "text")

plot(raster("C:/workspace2/argi.tif"))
plot(ca794, add = TRUE)

plot(raster("C:/workspace2/argi_se.tif"))
plot(ca794, add = TRUE)

# Download clipped example from Pinto Basin Joshua Tree
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/logistic/argi_pb.zip"
download.file(githubURL, destfile = "C:/workspace2/argi_pb.zip")
unzip(zipfile="C:/workspace2/argi_pb.zip", exdir="C:/workspace2")

6.4.1 Exercise 2: View the prediction in ArcGIS

  • Examine the raster predictions in ArcGIS and compare them to the Shapefile of that contains the original observations (hint classify the Shapefile symbology using the argillic column)
  • Discuss with your group, and report your observations or hypotheses

7 References

Beaudette, D. E., & O’Geen, A. T, 2009. Quantifying the aspect effect: an application of solar radiation modeling for soil survey. Soil Science Society of America Journal, 73:1345-1352

Gessler, P. E., Moore, I. D., McKenzie, N. J., & Ryan, P. J, 1995. Soil-landscape modelling and spatial prediction of soil attributes. International Journal of Geographical Information Systems, 9:421-432

Gorsevski, P. V., Gessler, P. E., Foltz, R. B., & Elliot, W. J, 2006. Spatial prediction of landslide hazard using logistic regression and ROC analysis. Transactions in GIS, 10:395-415

Evans, D.M. and Hartemink, A.E., 2014. Digital soil mapping of a red clay subsoil covered by loess. Geoderma, 230:296-304.

Hosmer Jr, D.W., Lemeshow, S. and Sturdivant, R.X., 2013. Applied logistic regression (Vol. 398). John Wiley & Sons

Kempen, B., Brus, D. J., Heuvelink, G., & Stoorvogel, J. J. (2009). Updating the 1: 50,000 Dutch soil map using legacy soil data: A multinomial logistic regression approach. Geoderma, 151:311-326.

Nauman, T. W., and J. A. Thompson, 2014. Semi-automated disaggregation of conventional soil maps using knowledge driven data mining and classification trees. Geoderma 213:385-399. http://www.sciencedirect.com/science/article/pii/S0016706113003066

Peterson, F.F., 1981. Landforms of the basin and range province: defined for soil survey. Nevada Agricultural Experiment Station Technical Bulletin 28, University of Nevada - Reno, NV. 52 p. http://jornada.nmsu.edu/files/Peterson_LandformsBasinRangeProvince.pdf

Shi, X., L. Girod, R. Long, R. DeKett, J. Philippe, and T. Burke, 2012. A comparison of LiDAR-based DEMs and USGS-sourced DEMs in terrain analysis for knowledge-based digital soil mapping. Geoderma 170:217-226. http://www.sciencedirect.com/science/article/pii/S0016706111003387

8 Additional reading

Lane, P.W., 2002. Generalized linear models in soil science. European Journal of Soil Science 53, 241- 251. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2002.00440.x/abstract

James, G., D. Witten, T. Hastie, and R. Tibshirani, 2014. An Introduction to Statistical Learning: with Applications in R. Springer, New York. http://www-bcf.usc.edu/~gareth/ISL/

Hengl, T. 2009. A Practical Guide to Geostatistical Mapping, 2nd Edt. University of Amsterdam, www.lulu.com, 291 p. ISBN 978-90-9024981-0. http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c0w.pdf